As marketers, helping search engines answer that basic question is one of our most important tasks. Search engines can’t read pages like humans can, so we incorporate structure and clues as to what our content means. This helps provide the relevance element of search engine optimization that matches queries to useful results.
Understanding the techniques used to capture this meaning helps to provide better signals as to what our content relates to, and ultimately helps it to rank higher in search results. This post explores a series of on-page techniques that not only build upon one another, but can be combined in sophisticated ways.
While Google doesn’t reveal the exact details of its algorithm, over the years we’ve collected evidence from interviews, research papers, US patent filings and observations from hundreds of search marketers to be able to explore these processes. Special thanks to Bill Slawski, whose posts on SEO By the Sea led to much of the research for this work.
As you read, keep in mind these are only some of the ways in which Google could determine on-page relevancy, and they aren’t absolute law! Experimenting on your own is always the best policy.
We’ll start with the simple, and move to the more advanced.
In the beginning, there were keywords. All over the page.
The concept was this: If your page focused on a certain topic, search engines would discover keywords in important areas. These locations included the title tag, headlines, alt attributes of images, and throughout in the text. SEOs helped their pages rank by placing keywords in these areas.
Even today, we start with keywords, and it remains the most basic form of on-page optimization.
Most on-page SEO tools still rely on keyword placement to grade pages, and while it remains a good place to start, research shows its influence has fallen.
While it’s important to ensure your page at a bare minimum contains the keywords you want to rank for, it is unlikely that keyword placement by itself will have much of an influence on your page’s ranking potential.
2. TF-IDF
It’s not keyword density, it’s term frequencyβinverse document frequency (TF-IDF).
Google researchers recently described TF-IDF as “long used to index web pages” and variations of TF-IDF appear as a component in several well-known Google patents.
TF-IDF doesn’t measure how often a keyword appears, but offers a measurement of importance by comparing how often a keyword appears compared to expectations gathered from a larger set of documents.
If we compare the phrases “basket” to “basketball player” in Google’s Ngram viewer, we see that “basketball player” is a more rare, while “basket” is more common. Based on this frequency, we might conclude that “basketball player” is significant on a page that contains that term, while the threshold for “basket” remains much higher.
For SEO purposes, when we measure TF-IDF’s correlation with higher rankings, it performs only moderately better than individual keyword usage. In other words, generating a high TF-IDF score by itself generally isn’t enough to expect much of an SEO boost. Instead, we should think of TF-IDF as an important component of other more advanced on-page concepts.
3. Synonyms and Close Variants
With over 6 billion searches per day, Google has a wealth of information to determine what searchers actually mean when typing queries into a search box. Google’s own research shows that synonyms actually play a role in up to 70% of searches.
To solve this problem, search engines possess vast corpuses of synonyms and close variants for billions of phrases, which allows them to match content to queries even when searchers use different words than your text. An example is the query dog pics, which can mean the same thing as:
β’ Dog Photos β’ Pictures of Dogs β’ Dog Pictures β’ Canine Photos β’ Dog Photographs
On the other hand, the query Dog Motion Picture means something else entirely, and it’s important for search engines to know the difference.
From an SEO point of view, this means creating content using natural language and variations, instead of employing the same strict keywords over and over again.
Using variations of your main topics can also add deeper semantic meaning and help solve the problem of disambiguation, when the same keyword phrase can refer to more than one concept. Plant and factory together might refer to a manufacturing plant, whereas plant and shrub refer to vegetation.
Today, Google’s Hummingbird algorithm also uses co-occurrence to identify
synonyms for query replacement.
Under Hummingbird, co-occurrence is used to identify words that may be
synonyms of each other in certain contexts while following certain rules
according to which, the selection of a certain page in response to a query
where such a substitution has taken place has a heightened probability.
Where you place your words on a page is often as important as the words themselves.
Each web page is made up of different partsβheaders, footers, sidebars, and more. Search engines have long worked to determine the most important part of a given page. Both Microsoft and Google hold severalpatents suggesting content in the more relevant sections of HTML carry more weight.
Content located in the main body text likely holds more importance than text placed in sidebars or alternative positions. Repeating text placed in boilerplate locations, or chrome, runs the risk of being discounted even more.
Page segmentation becomes significantly more important as we move toward mobile devices, which often hide portions of the page. Search engines want to serve users the portion of your pages that are visible and important, so text in these areas deserves the most focus.
To take it a step further, HTML5 offers addition semantic elements such as